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Abstract 

What is information? Is it physical? We argue that in a Bayesian 
theory the notion of information must be defined in terms of its effects on 
the beliefs of rational agents. Information is whatever constrains rational 
beliefs and therefore it is the force that induces us to change our minds. 
This problem of updating from a prior to a posterior probability distribu- 
tion is tackled through an eliminative induction process that singles out 
the logarithmic relative entropy as the unique tool for inference. The re- 
sulting method of Maximum relative Entropy (ME), which is designed for 
updating from arbitrary priors given information in the form of arbitrary 
constraints, includes as special cases both MaxEnt (which allows arbi- 
trary constraints) and Bayes' rule (which allows arbitrary priors). Thus, 
ME unifies the two themes of these workshops - the Maximum Entropy 
and the Bayesian methods - into a single general inference scheme that 
allows us to handle problems that lie beyond the reach of either of the 
two methods separately. I conclude with a couple of simple illustrative 
examples. 



1 Introduction 

The general problem of inductive inference is to update from a prior probability 
distribution to a posterior distribution when new information becomes available. 
This raises several basic questions which are the subject of this paper. First, 
what is information? It is clear that data "contains" or "conveys" information, 
but what does this precisely mean? Is information some sort of physical fluid 
that can be contained or transported? Is information physicall Can we measure 
amounts of information? Do we need to? What is entropy? 

A second set of questions revolves around our methods to process informa- 
tion. We know that Bayes' rule is the natural way to update probabilities when 
the new information is in the form of data and we know that Jaynes' method 
of maximum entropy, MaxEnt, is designed to handle information in the form 
of constraints At first sight these two methods appear unrelated. Are they 
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compatible with each other? Are there other methods? Moreover, the range 
of applicability of either method is somewhat limited: Bayes' rule can handle 
arbitrary priors and data, and it can even handle some constraints, but not arbi- 
trary constraints. On the other hand, MaxEnt can handle arbitrary constraints 
even data, but not arbitrary priors. Can we extend these methods? 

As discussed in [5] the Shannon- Jaynes interpretation of entropy as a mea- 
sure of uncertainty or of amount of information is somewhat problematic. The 
issue is not purely academic because the way equations are set up to solve a 
problem and even the kind of problems that we are willing to consider are af- 
fected by the particular meaning attributed to quantities such as entropy or 
probability. The Shannon-Jaynes interpretation was fairly adequate for their 
purposes, namely, communication theory and statistical mechanics, but it is 
not at all clear that their entropy with its attendant interpretation was the 
appropriate tool for the very different problem of updating probabilities. 

The important contribution of Shore and Johnson [3] was the realization that 
any confusion surrounding the meaning of entropy could be, if not resolved, at 
least evaded by directly axiomatizing the procedure for updating probabilities 
instead of seeking dubious measures for a vaguely defined notion of information. 
Their argument, which is based on demanding consistency - if a problem can 
be solved in two different ways the two solutions must agree - is fundamen- 
tally sound. However, the detailed assumptions in their derivation have been 
criticized in [H [5] . 

Another approach to entropy was proposed by Skilling [6j . Although his ax- 
ioms were clearly inspired by Shore and Johnson, the method was very different 
in two respects. First, Skilling was not directly concerned with the problem 
of updating probabilities; his method was designed for the determination of 
positive-additive functions such as intensities in an image. In retrospect we see 
that the application to this particular problem was quite unfortunate because 
when the method failed to produce good image reconstructions the natural re- 
action was a widespread loss of confidence about entropy methods in general. 

The second difference, which I think is a truly significant contribution, is 
that Skilling's approach is a systematic method for induction. He spelled out 
in full detail how to construct a general theory from known special cases. The 
fundamental inductive principle is deceptively trivial: '// a general theory exists 
it must apply to special cases '. The basic idea is that when there exists a special 
case that happens to be known all candidate theories that fail to reproduce it 
must be discarded. Thus, the known special cases - called the axioms of the 
theory - constrain the form of the general theory, and the idea is that a sufficient 
number of such constraints will determine the general theory completely. Of 
course, there is always the unfortunate possibility that the desired general theory 
does not exist, but if it does, then the search can be conducted in a systematic 
and orderly way. 

Philosophers already had a name for such a method: they called it elimina- 
tive induction 7\. On the negative side, eliminative induction, like any other 
form of induction, is not guaranteed to work. It failed, for example, in Skilling's 
image reconstruction problem. On the positive side, eliminative induction adds 
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an interesting twist to Popper's scientific methodology. According to Popper 
scientific theories can never be proved right, they can only be proved false; a 
theory is corroborated only to the extent that all attempts at falsifying it have 
failed. Eliminative induction is fully compatible with Popper's notions but the 
point of view is just the opposite. Instead of focusing on failure to falsify one 
focuses on success: it is the successful falsification of all rival theories that cor- 
roborates the surviving one. The advantage is that one acquires a more explicit 
understanding of why competing theories are eliminated. 

The present paper is the third in a sequence devoted to clarifying the use of 
relative entropy as a tool for processing information and updating probabilities 
[21 [5]. In [2] we applied Skilling's method to the problem of Shore and Johnson. 
The answer to the question 'What is entropy?' turns out to be trivial and 
somewhat surprising: entropy needs no interpretation. We do not need to know 
what 'entropy' means, we only need to know how to use it. This explains why 
the "correct" interpretation had been so elusive ~ there is none. In [2] and then 
again in [S] the special cases, the axioms, were increasingly polished to clarify 
how alternative entropies are ruled out. Furthermore, in [2] we also discussed 
the question, central to any general method of updating, of the extent to which 
the distribution of maximum entropy is to be preferred over all others, the extent 
to which distributions with entropies less than the maximum are to be ruled 
out. 

In this paper we review how eliminative induction leads to a unique candi- 
date for a general theory of inference, the method of Maximum relative Entropy 
(ME), which is designed for updating from arbitrary priors given information in 
the form of arbitrary constraints. The three axioms used in [8j - locality, coor- 
dinate invariance, and consistency for independent subsystems - are sufficient 
to single out the logarithmic relative entropy as the unique tool for updating. 
In particular, we wish to elaborate further on the use of the third axiom - con- 
sistency for independent subsystems - to eliminate alternative entropies |12j . 

The idea is rather simple. The known special cases covered under axiom 
3 also include situations in which we have a large number N of independent 
identical systems where all sorts of inferences can be reliably carried out using 
various asymptotic techniques (laws of large numbers, large deviation theory, 
etc.). The close connection with the method of maximum entropy has been 
repeatedly emphasized by several authors [9]- [11]. We conclude that the loga- 
rithmic relative entropy is the only candidate for a general method for updating 
probabilities. Alternative entropies can be useful for other purposes - for ex- 
ample, when studying the information geometry of statistical manifolds - but 
not for a general theory of updating. 

In [5] we showed that the ME method includes both MaxEnt and Bayes' 
rule as special cases and therefore it unifies the two dominant themes of these 
workshops - the Maximum Entropy and Bayesian methods - into a single general 
inference scheme that allows us to handle problems that lie beyond the reach 
of either of the two methods separately. I conclude with a couple of simple 
illustrative examples. 

In a companion paper [13j we discuss the problem of multiple constraints. 
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Should the constraints be processed simultaneously or sequentially and, if so, 
in what order? There we also give an explicit example in which ME is used to 
simultaneously process information in the form of data and moment constraints. 

2 What is information? 

It is not unusual these days to hear that systems "carry" or "contain" informa- 
tion and that "information is physical" . This mode of expression can perhaps 
be traced to the origins of information theory in Shannon's theory of communi- 
cation. We say that we have received information when among the vast variety 
of messages that could conceivably have been generated by a distant source, we 
discover which particular message was actually sent. It is thus that the message 
"carries" information. The analogy with physics is straightforward: the set of 
all possible states of a physical system can be likened to the set of all possible 
messages, and the actual state of the system corresponds to the message that 
was actually sent. Thus, the system "conveys" a message: the system "carries" 
information about its own state. Sometimes the message might be difficult to 
read, but it is there nonetheless. 

This language - information is physical - useful as it has turned out to be, 
does not exhaust the meaning of the word 'information'. The goal of informa- 
tion theory, or better, communication theory, is to characterize the sources of 
information, to measure the capacity of the communication channels, and to 
learn how to control the degrading effects of noise. It is somewhat ironic but 
nevertheless true that this "information" theory is unconcerned with the cen- 
tral Bayesian issue of how the message affects the beliefs of a rational agent. A 
fully Bayesian information theory demands an explicit account of the relation 
between information and beliefs. 

Our desire to update from one state of belief to another is driven by the 
conviction that not all probability assignments arc equally good. One can argue 
that what makes one probability assignment better than another is that it better 
reflects some objective feature of the world, that it provides a better guide to 
the "truth" - whatever this might mean. The updating mechanism is supposed 
to allow us to incorporate information about the world into our beliefs. 

The implication is that when confronted with new information our choices 
as to what we are honestly and rationally allowed to believe should become 
correspondingly restricted. This, I propose, is the defining characteristic of 
information: Information is whatever constrains rational beliefs. An important 
aspect of this notion is that for a rational agent the updating is not optional; 
it is a moral imperative. Information is whatever forces a change of rational 
beliefs. 

Our definition captures an idea of information that is directly related to 
changing our minds: information is the driving force behind the process of 
learning. Note also that although there is no need to talk about amounts of 
information, whether measured in units of bits or otherwise, our notion of in- 
formation allows precise quantitative calculations. Indeed, by information in its 
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most general form, we mean the set of constraints on the family of acceptable 
posterior distributions and this is precisely the kind of information the method 
of maximum entropy has been designed to handle. 

It may be worthwhile to point out an analogy with Newtonian dynamics. 
The state of motion of a system is described in terms of momentum - the 
"quantity" of motion - while the change from one state to another is explained 
in terms of an applied force. Similarly, in Bayesian inference a state of belief is 
described in terms of probabilities - the "quantity" of belief - and the change 
from one state to another is due to information. Just as a force is defined as 
that which induces a change in motion, so information is that which induces a 
change of beliefs. 

3 Updating probabilities: the ME method 

Consider a variable x which can be discrete or continuous, in one or several 
dimensions. The uncertainty about x is described by a probability distribution 
q{x). Our goal is to update from the prior distribution q(x) to a posterior dis- 
tribution P{x) when new information - that is, constraints ~ becomes available. 
The constraints could be given in terms of expected values but this is not nec- 
essary. The question is: of all those distributions p{x) within the family defined 
by the constraints, which do we select? 

As suggested by Skilling [5] to select the posterior it seems reasonable to 
rank the candidate distributions in order of increasing preference. It is clear 
that to accomplish this goal the ranking must be transitive: if distribution pi is 
preferred over distribution a-nd p2 is preferred over then pi is preferred 
over P3. Such transitive rankings are represented by assigning to each p{x) a 
real number S[p\^ which we will henceforth call entropy, in such a way that if 
Pi is preferred over p2, then S[pi\ > S[p2]- The selected distribution P (one or 
possibly many, for on the basis of the available information there may be several 
equally preferred distributions) will be that which maximizes the entropy S[p]. 
We are thus led to a method of Maximum Entropy (ME) that is a variational 
method involving entropies which are real numbers. These features are imposed 
on purpose; they are dictated by the function that the ME method is designed 
to perform. 

Next, to define the ranking scheme, we must decide on the functional form 
of S[p]. First, the purpose of the method is to update from priors to posteri- 
ors. The ranking scheme must depend on the particular prior q and therefore 
the entropy S must be a functional of both p and q. Thus the entropy S[p, q] 
produces a ranking of the distributions p relative to the given prior q: S[p, q] 
is the entropy of p relative to q. Accordingly S[p, q] is commonly called rela- 
tive entropy. Since all entropies are relative, even when relative to a uniform 
distribution, the modifier 'relative' is redundant and will be dropped. 

Second, since we deal with incomplete information the method, by its very 
nature, cannot be deductive: the method must be inductive. The best we can do 
is use those special cases where we know what the preferred distribution should 
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be to eliminate those entropy functionals S[p, q] that fail to provide the right 
update. The known special cases will be called (perhaps inappropriately) the 
axioms of the theory. They play a crucial role: they define what makes one 
distribution preferable over another. 

The three axioms below are chosen to reflect the conviction that information 
collected in the past and codified into the prior distribution is very valuable and 
should not be frivolously discarded. This attitude is maximally conservative: 
the only aspects of one's beliefs that should be updated are those for which 
new evidence has been supplied. Furthermore, since the axioms do not tell us 
what and how to update, they merely tell us what not to update, they have 
the added bonus of maximizing objectivity - there are many ways to change 
something but only one way to keep it the same. Thus, we adopt the 

Principle of Minimal Updating (PMU): Beliefs should be updated only 
to the extent required by the new information. 

The three axioms, a brief motivation for them, and their consequences for the 
functional form of the entropy are listed below; more details and proofs are 
given in [5] and [5j. As will become immediately apparent the axioms do not 
refer to merely three cases; any induction from such a weak foundation would 
hardly be reliable. The reason the axioms are convincing and so constraining is 
that they refer to three infinitely large classes of known special cases. 

Axiom 1: Locality. Local information has local effects. 
Suppose the information to be processed does not refer to a particular subdo- 
main V of the space X of x's. In the absence of any new information about 
V the PMU demands we do not change our minds about V. Thus, we design 
the inference method so that g(a:|I?), the prior probability of x conditional on 
a; G "D, is not updated. The selected conditional posterior is P{x\'D) — q{x\'D). 
The consequence of axiom 1 is that non-overlapping domains of x contribute 
additively to the entropy. Dropping additive terms and multiplicative factors 
that do not affect the overall ranking, the entropy functional can be simplified 
to the form 

S[p,q]= J dxF{p{x),q{x),x) , (1) 

where F is some unknown function. 

Axiom 2: Coordinate invariance. The system of coordinates carries no 
information. 

The points x can be labeled using any of a variety of coordinate systems. One 
can always change coordinates but this should not affect the ranking of the 
distributions. The consequence of axiom 2 is that S[p, q] can be written in 
terms of coordinate invariants such as dxm{x) and p{x)/m{x), and q{x)/m{x): 

S[p,q] - ldxm{x)^(^ ^\ . (2) 

(Again, additive terms and multiplicative factors that do not affect the overall 
ranking have been dropped.) Thus the unknown function F which had three 
arguments has been replaced by two unknown functions, one is a density m{x), 
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and the other is a function $ with two arguments. Next we determine the 
density m(x) by invoking the locahty axiom 1 once again. 

Axiom 1 (special case): When there is no new information there is no 
reason to change one's mind. 

When no new information is available the domain V in axiom 1 coincides with 
the whole space X. The conditional probabilities q{x\T>) — q{x\X) — q{x) 
should not be updated and the selected posterior distribution coincides with 
the prior, P{x) = q{x). The consequence is that up to normalization m{x) must 
be the prior distribution q{x), which restricts the entropy to functionals of the 
form 

S[p,q]^ J dxqix)^(^^^ . (3) 

Axiom 3: Consistency for independent subsystems. When a system 
is composed of subsystems that are known to be independent it should not matter 
whether the inference procedure treats them separately or jointly. 

Suppose the information on two independent subsystems 1 and 2 is such that 
the prior distributions qi{xi) and (72 (2^2) a-re respectively updated to Pi{xi) and 
P2{x2) when they are treated separately. When treated as a single system the 
joint prior is qi{xi)q2{x2) and the family of potential posteriors is p{xi,X2) = 
Pi{xi)p2{x2). The entropy functional must be such that the selected posterior 
is Pi{xi)P2{x2). The consequence of axiom 3 for this particular case of just two 
subsystems is that entropies are restricted to the one-parameter family given by 



S,j[p,q] = 



viv + 1) 



1 — / dxp{x) 



q{x) 



(4) 



Once again, additive terms and multiplicative factors that do not affect the 
overall ranking scheme can be freely chosen. The 77 = case reproduces the 
usual logarithmic relative entropy, 

S[p,q]^ ~ f dxp{x) logPP- (5) 

J q{x) 

[Use = exp ry log y fa 1 + t] log y in eq. ^ and let 77 ^ to get eq. ^ .] 

In [8] we argued that the index 77 has to be the same for all systems. To see 
why consider any two independent systems characterized by rji and 772. Consis- 
tency between the joint and separate updates requires that rji — 772 therefore 
T] must be a universal constant. From the success of statistical mechanics as a 
theory of inference we inferred that the value of this constant must be 77 = 
leading to the logarithmic entropy, eq.([5|). Here we offer a different argument 
also based on a broader application of axiom 3: 

Axiom 3 (special case): Consistency for large numbers of indepen- 
dent identical subsystems. 

The known special cases covered under axiom 3 include situations in which we 
have a large number N of independent identical systems. In such cases either 
the weak law of large numbers or large deviation theory in the form of Sanov's 
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theorem are sufficient to make tfie desired inferences. Entropy considerations 
are not needed. 

Let tfie X variables be discrete with i — 1 . . . m. The identical priors for 
the individual systems are qi and the available information is that the potential 
posteriors pi are subject, for example, to an expectation value constraint such 
as (a) = A, where A is some specified value and (a) = ^ UiPi. 

Consider the set of N systems treated jointly. Let the number of systems 
found in state i he rii, and let fi = rii/N be the corresponding frequency. In 
the limit of large N the frequencies fi converge (in probability) to the desired 
posterior Pi while the sample average a — aifi converges (also in probabil- 
ity) to the expected value (a) = A. The probability of a particular frequency 
distribution / = {/i .../„} generated by the prior q is multinomial, 

/VI 

QNif\q)= , • , giV--C" With En^^N, (6) 
?ii!...n„! i^i 

and for large N we have 

QN{f\q)-expN{S[f,q]+rN) , (7) 

where S[f,q\ given by eq.([5]), and where r^v is a correction that vanishes as 
N oo. To find the most probable frequency distribution satisfying the con- 
straint a — A one maximizes ifll) subject to a = A, which is equivalent to 
maximizing the entropy iS'[/, q] subject to a = A. The corresponding problem 
for the individual systems is that of maximizing Sr,[pj q] subject to (a) = A. The 
two procedures agree only when we choose rj = 0. Therefore, entropies Srj with 
r] ^ are not consistent with the laws of large numbers and must be discarded. 

Csiszar [10] and Grendar [TT] have argued that the asymptotic argument 
above provides a valid justification for the ME method of updating. An agent 
whose prior is q receives the information (a) = A which can be reasonably 
interpreted as a sample average a = A over a large ensemble of N trials. The 
agent's beliefs are updated so that the posterior P coincides with the most 
probable / distribution. This is quite compelling but, of course, as a justification 
of the ME method it is restricted to situations where it is natural to think in 
terms of ensembles with large N. This justification is not nearly as compelling 
for singular events for which large ensembles either do not exist or are too 
unnatural and contrived. From our point of view the asymptotic argument 
above does not by itself provide a fully convincing justification for the universal 
validity of the ME method but it does provide considerable inductive support. 
It serves as a valuable consistency check that must be passed by any inductive 
inference procedure that claims to be of general applicability. 

The results are summarized as follows: 
The ME method: The objective is to update from a prior distribution q to 
a posterior distribution given the information that the posterior lies within a 
certain family of distributions p. The selected posterior P{x) is that which 
maximizes the entropy S[p^ q\. Since prior information is valuable the functional 
S\p, q] has been chosen so that beliefs are updated only to the extent required by 
the new information. No interpretation for S[p, q] is given and none is needed. 
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4 Bayes' rule and its generalizations 



The problem is to update our beliefs about 9 € @ {9 represents one or many pa- 
rameters) on the basis of three pieees of information: (1) the prior information 
codified into a prior distribution q(9); (2) the data x <E X (obtained in one or 
many experiments): and (3) the known relation between 9 and x given by the 
model as defined by the sampling distribution or likelihood, q{x\9). The updat- 
ing consists of replacing the prior probability distribution q(9) by a posterior 
distribution P{9) that applies after the data has been processed. 

The crucial element that will allow Bayes' rule to be smoothly incorporated 
into the ME scheme is the realization that before the data information is avail- 
able not only we do not know 9, we do not know x either. Thus, the relevant 
space for inference is not Q but the product space Q x X and the relevant joint 
prior is q{x,9) = q{9)q{x\9). We should emphasize that the information about 
how x is related to 9 is contained in the functional form of the distribution 
q{x\9) - for example, whether it is a Gaussian or a Cauchy distribution - and 
not in the actual values of the arguments x and 9 which are, at this point, still 
unknown. 

Next we collect data and the observed values turn out to be x' . We must 
update to a posterior that lies within the family of distributions p{x, 9) that 
reflect the fact that x is known, 

p{x) = J d9p{e, x) = 5{x - x') . (8) 

This data information constrains but is not suSicient to determine the joint 
distribution 

p{x, 9) = p{x)p{9\x) = 6{x - x')p{9\x') . (9) 

Any choice of p{9\x') is in principle possible. Additional input is needed and 
it is at this point that we invoke the Principle of Minimal Updating: beliefs 
need to be revised only to the extent required by the data. Accordingly the 
conditional prior q{9\x') requires no revision and the selected posterior P{x,9) 
is such that P{e\x') = q{9\x'), or 

P{x,e) =6{x-x')q{e\x') . (10) 

The corresponding marginal posterior probability P{6) is 

P{9) = Jdx P{9, x) = qi9\x') = qi9)^^ , (11) 

q{x ) 

which is recognized as Bayes' rule. This is extremely reasonable: we m,aintain 
those beliefs about 9 that are consistent with the data values x' that turned 
out to be true. Data values that were not observed are discarded because they 
are now known to be false. 'Maintain' is the key word: it reflects the PMU in 
action. 

Remark: Bayes' rule is usually written in the form 

q{9\x') = q{9)^^ , (12) 
q{x'j 
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and called Bayes' theorem. This formula is very simple; perhaps it is too simple. 
It is just a restatement of the product rule - valid for any x' whether observed 
or not - and therefore it is a simple consequence of the internal consistency of 
the 'prior beliefs. The drawback of this formula is that the left hand side is not a 
posterior but rather a prior (conditional) probability; it obscures the fact that 
an additional principle - the PMU - was needed for updating. 

Next we show that Bayes' rule is consistent with, and indeed, is a special 
case of the ME method |;8j . This is not too surprising given that the ME is also 
based on the PMU. According to the ME method the selected joint posterior 
P(x, 9) is that which maximizes the entropy, 

S[p, q]^-J dxdO p{x, 0) log 4^ , (13) 

q(x, n) 

subject to the appropriate constraints. Note that the information in the data, 
eq. ([5]) , represents an infinite number of constraints on the family p(x,9): for each 
value of X there is one constraint and one Lagrange multiplier X(x). Maximizing 
S, p^ . subject to ^ and normalization, 

6{S + a[J dxdd p{x, 0) - l] + J dx X{x) [J d0 p{x, 6) - 5{x ~ x')] } , 

(14) 

yields the joint posterior, 

P{x,0)^q{x,0)^ , (15) 
where Z is a normalization constant, and A(x) is determined from ([5]), 

Jd0q{x,0)-^ = q{x)-^=S{x-x') , (16) 
so that the joint posterior is 

P{x, 0) = q{x, e) '^(^ - ^ 5{x ~ x')q{0\x) , (17) 
q{x) 

from which we recover Bayes' rule, ea.(|lip. 

I conclude with a couple of very simple examples that show how the ME al- 
lows generalizations of Bayes' rule. The background for these generalized Bayes 
problems is the familiar one: We want to make inferences about some variables 
on the basis of information about other variables x. As before, the prior in- 
formation consists of our prior knowledge about given by the distribution q{0) 
and the relation between x and is given by the likelihood q{x\0)] thus, the 
prior joint distribution q{x, 0) is known. But now the information about x is 
much more limited. 

Example 1.— The data is uncertain: x is not known. The marginal posterior 
p{x) is no longer a sharp delta function but some other known distribution, 
p{x) = Pd{x). This is still an infinite number of constraints 

p{x)^ S d0p{0,x)^PD{x) , (18) 
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that are easily handled by ME. Maximizing S, (|13p . subject to and nor- 
mahzation, leads to 

P{x,e)=PDix)qi9\x) . (19) 
The corresponding marginal posterior, 

P{e)^JdxPDix)qi9\x)^qi9)fdxPDix)^^ , (20) 

q[X) 

is known as Jeffrey's rule. 

Example 2. Now we have even less information: p{x) is not known. All we 
know about p{x) is an expected value 

{f)=Jdxpix)fix)^F. (21) 

Maximizing S, (fT5)) . subject to (PT|) and normalization, 

5{S + a[J dxdO p{x, 0)-l]+Xj dxdO p{x, 6)f{x) - F} = , (22) 

yields the joint posterior, 

P{x,0)^q{x,6)^^ , (23) 
where the normalization constant Z and the multiplier A are obtained from 

Z^ ^dx q{x)e^f^^^ and '^^^ = F . (24) 
The corresponding marginal posterior is 

P(0) ^ q(0) J dx ^^qix\9) . (25) 

The two posteriors (PO)) and are sufficiently intuitive that one could have 
written them down directly without deploying the full machinery of the ME 
method, but they do serve to illustrate the essential compatibility of Bayesian 
and Maximum Entropy methods. A less trivial example is given in [13]. 



5 Conclusions 

Any Bayesian account of the notion of information cannot ignore the fact that 
Bayesians are concerned with the beliefs of rational agents. The relation be- 
tween information and beliefs must be clearly spelled out. The definition we 
have proposed - that information is that which constrains rational beliefs and 
therefore forces the agent to change its mind - is convenient for two reasons. 
First, the information/belief relation very explicit, and second, the definition is 
ideally suited for quantitative manipulation using the ME method. 

The other main conclusion is that the logarithmic relative entropy is the only 
candidate for a general method for updating probabilities - the ME method - 
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which includes MaxEnt and Baycs' rule as special cases; it unifies them into a 
single theory of inductive inference. 

It is true that there exist many different ways to define measures of sepa- 
ration, or divergence between distributions and that these "entropies" can be 
useful in a wide variety of ways. In fact, it was precisely this wealth of possi- 
bilities that Shore and Johnson intended to avoid. These other "entropies" can 
be useful for other purposes but not for updating; at least not for an updating 
theory that strives to achieve universal applicability. Let us emphasize that the 
reason the ME method uses the logarithmic entropy as the tool for updating is 
not that this entropy has been shown to provide the correct measure of distance 
- there are many other such measures. We do not even claim that inferences 
on the basis of the ME method are guaranteed to be correct - this is induction; 
there are no guarantees. It is just that all alternative entropies are much worse 
because in known cases they give answers that are demonstrably wrong. 
Acknowledgements: I would like to acknowledge valuable discussions with C. 
Cafaro, N. Caticha, A. GifRn, K. Knuth, and C. Rodrfguez. 
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